-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add refashion dag #11
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quelques commentaires à discuter !
data = response.json() | ||
all_data.extend(data['results']) | ||
url = data.get('next', None) | ||
print(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On peut supprimer ce print
|
||
df_pds = pd.DataFrame(rows_list) | ||
df_pds.index = range(idx_max, idx_max + len(df_pds)) | ||
df_pds['id'] = df_pds.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cette solution me parait bancale, si il y a une seule création de propositionservice dans l'interface ce dags a planter
Pas sur qu'on ai besoin de créer de colonne id car c'est un auto incrément.
A tester
|
||
df_acteurtype = pd.read_sql_table('qfdmo_acteurtype', engine) | ||
df_sources = pd.read_sql_table('qfdmo_source', engine) | ||
df_ps = pd.read_sql_table('qfdmo_propositionservice', engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_sql_table
semble charger toute la table dans une dataframe
charger toute la table pour trouver l'id max semble un peu overkill ?
|
||
def transform_ecoorganisme(value, df_sources): | ||
id_value = df_sources.loc[df_sources['nom'].str.lower() == value.lower(), 'id'].values[0] if any( | ||
df_sources['nom'].str.lower() == value.lower()) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
est-ce qu'on a pas intérêt à lever une exception et/ou ignorer la ligne si l'éco-organisme n'est pas retrouvé dans la liste des sources ?
sous_categories = { | ||
"Vêtement": 107, | ||
"Linge": 104, | ||
"Chaussure": 109 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note pour plus tard, on pourrait extraire ces mapping dans un fichier de configuration
else: | ||
df[new_col] = df[old_col] | ||
df['label_reparacteur'] = False | ||
df['identifiant_unique'] = df.apply(lambda x: generate_unique_id(x, selected_columns=selected_columns), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Je propose que pour ne pas être dépendant des colonnes qui constitue cet id, on utiise comme identifiant :
SOURCE_IDEXTERNE(_d si c'est du digital)
A voir si ce format est suffisant pour obtenir des ID uniques sur l'ensemble du fichier
return pd.Series([address, postal_code, city]) | ||
|
||
|
||
def transform_location(longitude, latitude): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Toutes les fonctions de ransformation pourrait être testé unitairement
sous_categories = { | ||
"Vêtement": 107, | ||
"Linge": 104, | ||
"Chaussure": 109 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention aux id écrit en dur, rien ne garanti que les id sont les mêmesentre les environnements.
ici c'est le cas car on copie fréquemment la prod vers la preprod mais il est préférable de se baser sur un "code/nom"
Il y a peut-être une rationnalisation de la DB à faire ici : à discuter ensemble
No description provided.